Data Scientist
1000+ Data Scientist Interview Questions and Answers

Asked in Intellect Design Arena

Q. for a data with 1000 samples and 700 dimensions, how would you find a line that best fits the data, to be able to extrapolate? this is not a supervised ML problem, there's no target. and how would you do it, if...
read moreTo find a line that best fits the data with 1000 samples and 700 dimensions, we can use linear regression.
For unsupervised ML approach, we can use Principal Component Analysis (PCA) to reduce dimensions and then fit a line using linear regression.
For supervised ML approach, we need to select a target column. We can choose any of the 700 dimensions as the target and treat it as a regression problem.
Potential problems of treating this as a supervised problem include: lack of in...read more

Asked in Bajaj Finserv

Q. Special Sum of Array Problem Statement
Given an array 'arr' containing single-digit integers, your task is to calculate the total sum of all its elements. However, the resulting sum must also be a single-digit ...read more
Calculate the total sum of array elements until a single-digit number is obtained by repeatedly summing digits.
Iterate through the array and calculate the sum of all elements.
If the sum is a single-digit number, return it. Otherwise, repeat the process of summing digits until a single-digit number is obtained.
Return the final single-digit sum.
Data Scientist Interview Questions and Answers for Freshers

Asked in Affine

Q. You have a pandas dataframe with three columns filled with state names, city names, and arbitrary numbers, respectively. How do you retrieve the top two cities per state based on the maximum number in the third...
read moreRetrieve top 2 cities per state based on max number in third column of pandas dataframe.
Group the dataframe by state column
Sort each group by the third column in descending order
Retrieve the top 2 rows of each group using head(2) function
Concatenate the resulting dataframes using pd.concat() function

Asked in Walmart

Q. Describe the data you would analyze to solve cost and revenue optimization case studies. How would you formulate the objective functions?
Answering a question on data and objective function for cost and revenue optimization case studies.
For cost optimization, look at data related to expenses, production costs, and resource allocation.
For revenue optimization, look at data related to sales, customer behavior, and market trends.
Objective function for cost optimization could be minimizing expenses while maintaining quality.
Objective function for revenue optimization could be maximizing profits while satisfying cus...read more

Asked in Amazon

Q. Clone a Linked List with Random Pointers
Given a linked list where each node contains two pointers: one pointing to the next node and another random pointer that can point to any node within the list (or be nul...read more
Create a deep copy of a linked list with random pointers.
Iterate through the original linked list and create a new node for each node in the list.
Store the mapping of original nodes to new nodes in a hashmap to handle random pointers.
Update the random pointers of new nodes based on the mapping stored in the hashmap.
Return the head of the copied linked list.

Asked in Coforge

Q. Given a list of numbers, find the indices of two numbers that add up to a specific target value. Do this without using nested for loops. For example, given the list l = [2, 15, 5, 7] and target t = 9, the outpu...
read moreFinding index of 2 numbers having total equal to target in a list without nested for loop.
Use dictionary to store the difference between target and each element of list.
Iterate through list and check if element is in dictionary.
Return the indices of the two elements that add up to target.
Data Scientist Jobs




Asked in EXL Service

Q. How would you measure model effectiveness without using any confusion matrix metrics, given the data is highly imbalanced?
One way to measure model effectiveness without using confusion matrix metrics is by using area under the receiver operating characteristic curve (AUC-ROC).
Calculate the AUC-ROC score to evaluate the model's ability to distinguish between positive and negative classes.
AUC-ROC considers the entire range of classification thresholds and is insensitive to class imbalance.
Higher AUC-ROC score indicates better model performance.
Example: A model with an AUC-ROC score of 0.85 perform...read more

Asked in Intellect Design Arena

Q. what is tokenization in NLP? and, to get raw tokens for a sentence with words seperated by space, why use tokenizers from nltk instead of str.split()?
Tokenization in NLP is the process of breaking down text into smaller units called tokens.
Tokenization is a fundamental step in NLP for text preprocessing.
Tokens can be words, phrases, or even individual characters.
Tokenization helps in preparing text data for further analysis or modeling.
NLTK tokenizers provide additional functionalities like handling contractions, punctuation, etc.
str.split() may not handle complex tokenization scenarios as effectively as NLTK tokenizers.
Share interview questions and help millions of jobseekers 🌟
Asked in GeekBull Consulting

Q. You have two different vectors with only a small change in one of the dimensions, but the predictions/output from the model are drastically different for each vector. Can you explain why this can be the case? I...
read moreSmall change in one dimension causing drastic difference in model output. Explanation and solution.
This is known as sensitivity to input
It can be caused by non-linearities in the model or overfitting
Regularization techniques can be used to reduce sensitivity
Cross-validation can help identify overfitting
Ensemble methods can help reduce sensitivity
It is generally a bad thing as it indicates instability in the model

Asked in ExxonMobil

Q. In which direction does fluid flow in a vertical pipe when the pressures at two vertical locations are given?
The direction of fluid flow in a vertical pipe depends on the pressure difference between two vertical locations.
Fluid flows from high pressure to low pressure.
If the pressure at the lower location is higher than the pressure at the upper location, the fluid will flow downwards.
If the pressure at the upper location is higher than the pressure at the lower location, the fluid will flow upwards.
The magnitude of the pressure difference determines the rate of fluid flow.

Asked in Walmart

Q. How can you tune the hyperparameters of XGBoost, Random Forest, and SVM algorithms?
Hyperparameters of XGBoost, Random Forest, and SVM can be tuned using techniques like grid search, random search, and Bayesian optimization.
For XGBoost, important hyperparameters to tune include learning rate, maximum depth, and number of estimators.
For Random Forest, important hyperparameters to tune include number of trees, maximum depth, and minimum samples split.
For SVM, important hyperparameters to tune include kernel type, regularization parameter, and gamma value.
Grid ...read more

Asked in Intellect Design Arena

Q. When tokenizing, if you want to avoid breaking up specific word pairs (or triplets), for example, to not tokenize the words 'first' and 'name' when they occur together and consider them as a single token, how w...
read moreUse NLTK's MWETokenizer to preserve specific word pairs or triplets during tokenization.
MWETokenizer allows you to define multi-word expressions (MWEs) that should be treated as single tokens.
Example: If you define MWE as ('first', 'name'), the tokenizer will keep 'first name' together.
You can add multiple MWEs, such as ('New', 'York') and ('data', 'science').
This is particularly useful in NLP tasks where context matters, like sentiment analysis.
Asked in NeenOpal Intelligent Solutions

Q. Q2.) Difference between list and tuple? a = [1,2,3,4,5,6,7,8,9] print(a[-1:-5]) Without running this code in compiler, tell the output
The code will output an empty list as a result of slicing from -1 to -5 in the list 'a'.
Slicing in Python allows you to access a subset of elements in a list or tuple.
When slicing, the start index is inclusive and the end index is exclusive.
In this case, a[-1:-5] will result in an empty list because the start index -1 is greater than the end index -5.

Asked in Affine

Q. How do you retain special characters (that pandas discards by default) in the data while reading it?
To retain special characters in pandas data, use encoding parameter while reading the data.
Use encoding parameter while reading the data in pandas
Specify the encoding type of the data file
Example: pd.read_csv('filename.csv', encoding='utf-8')

Asked in Rolls-Royce

Q. What are the types of ML algorithms? Give an example of each.
There are several types of ML algorithms, including supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning: algorithms learn from labeled data to make predictions or classifications (e.g., linear regression, decision trees)
Unsupervised learning: algorithms find patterns or relationships in unlabeled data (e.g., clustering, dimensionality reduction)
Reinforcement learning: algorithms learn through trial and error by interacting with an enviro...read more
Asked in NeenOpal Intelligent Solutions

Q. Given sample data in text format, how would you read it into Python, check for null and unique values, and create a new column by multiplying two existing features?
Read sample data in text, check for null and unique values, create new column by multiplying two features
Save text data as CSV and read in Python using pandas
Use isnull() to check for null values
Use nunique() to check for unique values
Create a new column by multiplying two existing columns
Add the new column to the existing dataframe

Asked in Chubb

Q. how will you get the embeddings of long sentences/paragraphs that transformer models like BERT truncate? how will you go about using BERT for such sentences? will you use sentence embeddings or word embeddings...
read moreTo get embeddings of long sentences/paragraphs truncated by BERT, we can use pooling techniques like mean/max pooling.
We can use pooling techniques like mean/max pooling to get embeddings of truncated sentences/paragraphs.
We can also use sliding window approach to get embeddings of overlapping segments of the long input.
For using BERT on such long inputs, we can use sentence embeddings or word embeddings depending on the task.
Models like Longformer and Reformer can handle lon...read more

Asked in Feynn Labs

Q. What is the difference between Linear Regression and Logistic Regression?
Linear Regression is used for predicting continuous numerical values, while Logistic Regression is used for predicting binary categorical values.
Linear Regression predicts a continuous output, while Logistic Regression predicts a binary output.
Linear Regression uses a linear equation to model the relationship between the independent and dependent variables, while Logistic Regression uses a logistic function.
Linear Regression assumes a linear relationship between the variables...read more

Asked in Walmart

Hyperparameters of XGBoost can be tuned using techniques like grid search, random search, and Bayesian optimization.
Use grid search to exhaustively search through a specified parameter grid
Utilize random search to randomly sample hyperparameters from a specified distribution
Apply Bayesian optimization to sequentially choose hyperparameters based on the outcomes of previous iterations

Asked in Accenture

Q. Why we use mission learning Mission learning used for analysis the data's and we can able to predict and we add some additional algorithm it's mainly used for prediction and AI.
Mission learning is used for data analysis and prediction with additional algorithms for AI.
Mission learning is a subset of machine learning that focuses on predicting outcomes based on data analysis.
It involves using algorithms to learn patterns and make predictions based on new data.
Examples include image recognition, natural language processing, and recommendation systems.

Asked in AB InBev India

Q. How did you prevent your model from overfitting ? What did you do when it was underfit ?
To prevent overfitting, I used techniques like regularization, cross-validation, and early stopping. For underfitting, I tried increasing model complexity and adding more features.
Used regularization techniques like L1 and L2 regularization to penalize large weights
Used cross-validation to evaluate model performance on different subsets of data
Used early stopping to prevent the model from continuing to train when performance on validation set stops improving
For underfitting, ...read more

Asked in Turing

Q. What is the neighborhood in which superhosts have the biggest median price difference with respect to non-superhosts?
The neighbourhood with the biggest median price difference between superhosts and non superhosts is X.
Calculate the median price for superhosts and non superhosts in each neighbourhood
Find the neighbourhood with the largest difference in median prices between superhosts and non superhosts
Example: Neighbourhood X has a median price of $200 for superhosts and $150 for non superhosts, resulting in a $50 difference

Asked in Affine

Q. How would you approach finding the number of white cars in a city?
Estimate the number of white cars using surveys, traffic data, and image recognition techniques.
Conduct surveys: Ask residents about car colors in their neighborhoods.
Use traffic cameras: Analyze footage to count white cars during peak hours.
Leverage social media: Analyze posts or images of cars in the city.
Utilize machine learning: Train a model on images of cars to identify white ones.
Collaborate with local authorities: Access registration data for car colors.

Asked in Citicorp

Q. Which test is used in logistic regression to check the significance of the variable?
The Wald test is used in logistic regression to check the significance of the variable.
The Wald test calculates the ratio of the estimated coefficient to its standard error.
It follows a chi-square distribution with one degree of freedom.
A small p-value indicates that the variable is significant.
For example, in Python, the statsmodels library provides the Wald test in the summary of a logistic regression model.

Asked in Nielsen

Q. Write pandas query to separate the names as first and last name from the full name. Drop the duplicate columns and also the missing values. Write output for the Python code. Write SQL query to retrieve the name...
read moreAnswering questions related to data science concepts and techniques.
Recall is the ratio of correctly predicted positive observations to the total actual positives. Precision is the ratio of correctly predicted positive observations to the total predicted positives.
To reduce variance in an ensemble model, techniques like bagging, boosting, and stacking can be used. Bagging involves training multiple models on different subsets of the data and averaging their predictions. Boost...read more

Asked in Walmart

Q. What do these hyperparameters in the above-mentioned algorithms actually mean?
Hyperparameters are settings that control the behavior of machine learning algorithms.
Hyperparameters are set before training the model.
They control the learning process and affect the model's performance.
Examples include learning rate, regularization strength, and number of hidden layers.
Optimizing hyperparameters is important for achieving better model accuracy.

Asked in MasterCard

Q. How do you deal with senior customers when you don't have enough data?
Communicate transparently and offer alternative solutions.
Explain the limitations of the available data and the potential risks of making decisions based on incomplete information.
Offer alternative solutions that can be implemented with the available data.
Collaborate with the customer to identify additional data sources or explore other options to gather more data.
Provide regular updates on the progress of data collection and analysis.
Ensure that all decisions are based on so...read more

Asked in Walmart

Hyperparameters in XGBoost algorithm control the behavior of the model during training.
Hyperparameters include parameters like learning rate, max depth, number of trees, etc.
They are set before the training process and can greatly impact the model's performance.
Example: 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100

Asked in Affine

Q. How will the resultant table be when you merge two tables that match on a column, and the second table has many repeated keys?
The resultant table will have all the columns from both tables and the rows will be a combination of matching rows.
The resultant table will have all the columns from both tables
The rows in the resultant table will be a combination of matching rows
If the second table has repeated keys, there will be multiple rows with the same key in the resultant table

Asked in Axtria

Ridge and LASSO regression are both regularization techniques used in linear regression to prevent overfitting by adding penalty terms to the cost function.
Ridge regression adds a penalty term equivalent to the square of the magnitude of coefficients (L2 regularization).
LASSO regression adds a penalty term equivalent to the absolute value of the magnitude of coefficients (L1 regularization).
Ridge regression tends to shrink the coefficients towards zero but does not set them e...read more
Interview Questions of Similar Designations
Interview Experiences of Popular Companies





Top Interview Questions for Data Scientist Related Skills

Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary


Reviews
Interviews
Salaries
Users

